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Automatic DTD simplification by examples 

This paper describes a method for the automatic generation of simplified 
DTDs from a source DTD and a set of sample marked up files. The purpose is 
to create the minimum DTD that the sample set of files com ply. In this way, 
new files can be created and parsed using this simplified DTD but still being 
compliant to the original, more general DTD. The simplified DTD can be used 
to make the task of markup easier, specially for non-experienced XML 
writers. 

The resulting tool was used at the Miguel de Cervantes digital library 
(http://cervantesvirtual.com/) to obtain simplified versions of the TEI.DTD 
(Sperberg-McQueen and Burnard, 1994). This work is part of a larger project 
in the field of text markup and derived applications (Bia and Pedreno, 2000). 

Motivation 

"Having standardized-XML-vocabularies for common things allows 
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developers to reuse existing DTDs, saving the cost of developing custom 
DTDs. Custom DTDs isolate their users and applications from others that 
might otherwise be able to share commonly formatted documents and data. 
Shared DTDs are the foundation of XML data interchange and reuse" (Hunter, 
2000). 

Saving the cost of developing our own DTD, and text interchangeability are 
some of the reasons why the teixlite.dtd (XML version of the SGML 
teilite.dtd of the TEI encoding scheme) has been chosen at the Cervantes 
digital library, but the TEIxlite is still too complex for markup beginners. Our 
markup team is composed mostly of humanists with some computer skills but 
who appreciate their computer work be simplified as much as possible. 

On the other hand our XML documents do not use, and do not need all the 
markup options provided by the teixlite.dtd. So a simpler DTD was needed to 
simplify markup tasks and to avoid possible use of unwanted markup options. 
But we still wanted our files to be TEI compliant and benefit from the 
advantages of sharing a common DTD with other international digitization 
projects. In brief, we needed a simpler DTD, a TEI compliant DTD, that is a 
valid subset of the teilite.dtd. 

We started by defining the kinds of modifications we will allow ourselves to 
make to the TEIlite DTD, in order to make it simpler to use but at the same 
time keeping our documents TEI-compatible (except for minor exceptions). In 
this sense we allowed the following changes: 

• To add normalized values to some attributes in order to force the use of 
fixed values instead of free data entry. 

• JT o add n ew attributes only in a few necessary cases (this is the only 
exception that mayjkej£j^ but we 
thought that these added attributes can be easily eliminated anytime we 
wanted to comply the TEI standard). 

• To make restrictions in element inclusion rules (we wanted to eliminate 
the possibility of including certain elements at certain levels of the 
markup). 

• To make some optional elements/attributes mandatory to force following 
our specific markup norms. 

• To elimin ate optional elements we will not use to simplify the markup 
task and to avoid possible errors (basically we wanted to eliminate the 
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features we decided not to use) 

It is clear that doing the simplifications by hand is tedious and error prone. 
Constructing a set of sample documents representative of all the types of 
documents we need to markup together with a program that simplifies the 
DTD automatically will alleviate this task. 

Previous works 

Document types are defined by extended context-free grammars where the 
right hand side of productions are unambiguous reg ular ex pressions 
(Bruggemann, 1998). Previous work has addressed the task of identifying a 
DTD from examples. A common difficulty in this approach is the need to find 
a correct degree of generalization. Some practical tools as FRED (Shafer, 
1995) let the users customize their preferred degree of generalization. Ahonen 
builds a (k,h)-testable model (Ahonen, 1995; Ahonen, 1997; Ahonen, 
Mannila, and Nikunen, 1997). 

Youg-Lai and Tompa (Young-Lai and Tompa, 2000) rely on a stochastic 
approach to control overgeneralization, based in turn on the algorithm by 
Carrasco and Oncina (Carrasco, 1998). Presumably, the stochastic approach 
needs large collections of hand-tagged documents. 

Pizza-Chef (Burnard, 1997) is a tool to generate TEI-compliant DTDs suited 
to a particular task. In this case, predefined tasks and TEI DTDs are only 
allowed. 

Objectives 

However, a general DTD defining a global frame that a whole set of files must 
fulfill allows for a natural way to avoid overgeneralization. In this sense, any 
particularized, narrow scope DTD should not accept any document that is not 
accepted by the general, wide scope DTD. 

Therefore, the objective of our approach is to automatic^y^leclpd 
DTD features that are^^byj_s^ofvalid documents (validated against the 
more general DTD) and eliminate the rest of them, obtaining a narrow scope 
DTD which def^s_a^ubs„e,tof the |_ original markup scheme. This "pruned 11 
DTD can be used to build new documents of the same markup subclass, which 
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in turn would still comply the original general DTD. Needless to say that 
working with a simpler DTD is easier. 

General description 

For the implementation of the DTDprune toolkit we needed both an XML and 
a DTD parser. We assumed that both the XML sample files and the source 
DTD would be well-formed and valid, so there would be no need to build 
validating parsers. Instead, we developed two simple parsers, based on the 
XML BNF Grammar described in (Harold, 1999). A diagram of the process is 
shown below in the figure. 



Architecture of the DTD simplifier 

Build 
Glushkov 
automata 



XML 
Sample 
files 



As the diagram shows, the general DTD is_processed to extract the structure of 
the markup model with which we build_a Glushkov automata (Caron and 
Ziadi, 2000). The XML sample files are preproce ssed to extract the el ements 
used and their nesting patterns. Base d on the Glushkov automata that 
represent the r e^l|rr e^Te s s ions that define the possiEieTelement contents 
according to the general DTD, we keep track of thejdements used in the 
sample files and markjhe yjsjteilstates of the automata. Finally, a 
simplifip.ation_pxocess takes place. This process eliminates unused elements 
and simplifies the right parts of element definitions,' i.e. the regular 
expressions that define further nestings. The simplified DTD structure is used 
to generate the new simplified DTD. 

Conclusions 

Using this automated method, the simplified DTD can be updated 
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immediately in the event that n ew features are a dded to (oreliminated from) 
the sample set of XML files (modifications to files of the sample-set must be 
done using the general DTD for validation). This process can be repeated to 
incrementally produce a final narrow-scope DTD. In this way, we use a 
complex DTD as a general markup-design frame to build a simpler working- 
DTD that suits a specific project's markup needs. 

Another use of this technique is to build a one-document DTD, i.e. the 
minimum DTD derived from the general DTD that a given XML document 
would comply. 

Another benefit of this technique is that we can produce statistics that may 
help markup designers improve their markup schemes. Information about the 
frequency of use of certain elements within others, helps us to detect unusual 
structures that could reflect mark-up mistakes, misuse of the DTD, or DTD 
features that may allow unwanted generalization. This statistical data on the 
use of markup may help us take decisions about adding new markup 
constraints, or on the contrary expand the simplified DTD. 
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